1. How many breweries are present in each state?

Colorado and California have the most breweries. Each US region has 1-2 states with more breweries as well:

2. Merge beer data with the breweries data. Print the first 6 observations and the last six observations to check the merged file.

##   Brew_ID       Name.Brewery        City State     Name.Beer Beer_ID   ABV IBU
## 1       1 NorthGate Brewing  Minneapolis    MN       Pumpion    2689 0.060  38
## 2       1 NorthGate Brewing  Minneapolis    MN    Stronghold    2688 0.060  25
## 3       1 NorthGate Brewing  Minneapolis    MN   Parapet ESB    2687 0.056  47
## 4       1 NorthGate Brewing  Minneapolis    MN  Get Together    2692 0.045  50
## 5       1 NorthGate Brewing  Minneapolis    MN Maggie's Leap    2691 0.049  26
## 6       1 NorthGate Brewing  Minneapolis    MN    Wall's End    2690 0.048  19
##                                 Style Ounces
## 1                         Pumpkin Ale     16
## 2                     American Porter     16
## 3 Extra Special / Strong Bitter (ESB)     16
## 4                        American IPA     16
## 5                  Milk / Sweet Stout     16
## 6                   English Brown Ale     16
##      Brew_ID                  Name.Brewery          City State
## 2405     556         Ukiah Brewing Company         Ukiah    CA
## 2406     557       Butternuts Beer and Ale Garrattsville    NY
## 2407     557       Butternuts Beer and Ale Garrattsville    NY
## 2408     557       Butternuts Beer and Ale Garrattsville    NY
## 2409     557       Butternuts Beer and Ale Garrattsville    NY
## 2410     558 Sleeping Lady Brewing Company     Anchorage    AK
##                      Name.Beer Beer_ID   ABV IBU                   Style Ounces
## 2405             Pilsner Ukiah      98 0.055  NA         German Pilsener     12
## 2406         Porkslap Pale Ale      49 0.043  NA American Pale Ale (APA)     12
## 2407           Snapperhead IPA      51 0.068  NA            American IPA     12
## 2408         Moo Thunder Stout      50 0.049  NA      Milk / Sweet Stout     12
## 2409  Heinnieweisse Weissebier      52 0.049  NA              Hefeweizen     12
## 2410 Urban Wilderness Pale Ale      30 0.049  NA        English Pale Ale     12

3. Address the missing values in each column.

There are 5 missing values for Style, 62 missing values for ABV, and 1,005 missing values for IBU.

Beer Style is a grouping/classification for beers that’s been established by brewers based on brewing traditions and their domain expertise. Beers within an established Beer Style tend to have more similar alcohol content and bitterness (low within group variation), whereas beers that differ in Style tend to have less similar ABV and IBUs (high between group variation). Based on this knowledge, median ABV and IBU values were calculated for all 100 Beer Styles. Missing ABV and IBU values were generally addressed by replacing NA’s with the matching Beer Style’s median values.

2 of the 5 missing Style values were imputed based on the individual beer names. “OktoberFiesta” had a beer name, ABV, and IBU that were consistent with the other Oktoberfest style beers. Similarly, “Kilt Lifter Scottish-Style Ale” was consistent with the Scottish Ales style beers. The remaining 3 beers with missing Style values did not have enough information to classify their Style. For those 3, Style was left blank, and the missing IBU and ABV values were set to the overall median values of the entire data set.

55 of the 1,005 missing IBU values could not be imputed this way because those Styles were missing all IBU values. These 55 beers were associated with 10 unique Styles that had no IBU values. Ciders, Meads, Shandies, and Rauchbiers (smoked beers) styles are all typically made with no hops or bittering of any kind. These missing IBU values were set to 0 based on this domain knowledge. There were 2 exceptions where the product names indicated there may be some hops added, contrary to the style conventions: Cider “Nunica Pine” and Mead “Nectar of Hops”. For these 2, the missing IBU values were set to the overall median IBU value for the entire data set.

Finally, the remaining 11 missing IBU values were set to the overall median IBU value for the entire data set.

All further analyses were done by both excluding the missing values and by using the imputed values.

4. Compute the median alcohol content and international bitterness unit for each state. Plot a bar chart to compare.

Utah has the lowest median ABV likely due to its strcit state alcohol laws and regulations. Kentucky and Washington DC have the highest median ABV.

Replacing missing values for ABV does not really change the trend. This is not surprising since we’re only missing 2-3% of the data. Delaware does get a notable bump in the trend.

Replacing missing values for IBU makes a big difference! There are big swings in the trends after imputing 40% of the data based on beer styles. The Northeastern states have the most notable shifts: Maine, New Hampshire, and Vermont. Some of the lower values for NH are driven by missing data for sour beers which have very low IBUs. The higher values for VT are driven by hoppier Pale Ales and IPAs that may be part of the trending “Juicy/Hazy” New England style IPAs.

5. Which state has the maximum alcoholic (ABV) beer? Which state has the most bitter (IBU) beer?

Colorado has the maximum alcoholic (ABV) beer of 12.8%, “Lee Hill Series Vol. 5 - Belgian Style Quadrupel Ale” by Upslope Brewing.

Oregon has the most bitter (IBU) beer of 138 IBU, “Bitter Bitch Imperial IPA” by Astoria Brewing Company.

##                 Style Brew_ID            Name.Brewery    City State
## 2170 Quadrupel (Quad)      52 Upslope Brewing Company Boulder    CO
##                                                 Name.Beer Beer_ID   ABV IBU
## 2170 Lee Hill Series Vol. 5 - Belgian Style Quadrupel Ale    2565 0.128  NA
##      Ounces median.ABV count.x  ABV1 median.IBU count.y IBU1
## 2170   19.2      0.099       4 0.128         24       1   24
##                              Style Brew_ID            Name.Brewery    City
## 518 American Double / Imperial IPA     375 Astoria Brewing Company Astoria
##     State                 Name.Beer Beer_ID   ABV IBU Ounces median.ABV count.x
## 518    OR Bitter Bitch Imperial IPA     980 0.082 138     12      0.087     103
##      ABV1 median.IBU count.y IBU1
## 518 0.082         91      75  138

6. Comment on the summary statistics and distribution of the ABV variable

ABV values range from 0.1% to 12.8% with a median of 5.6% and a mean of 6.0%. The higher mean vs. the median indicates the distribution is right-skewed, and the histogram plot visually confirms this. This right-skew may indicate a shift in the beer market towards “bigger” high alcohol beers.

summary(Beer$ABV1)
##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
## 0.00100 0.05000 0.05600 0.05972 0.06700 0.12800
Beer %>% ggplot(aes(ABV1*100)) + geom_histogram(fill="darkblue",color="black") + xlab("% Alcohol by Volume (%ABV)") + ggtitle("Distribution of Beer %ABV, Right-Skewed")
## `stat_bin()` using `bins = 30`. Pick better value with `binwidth`.

7. Is there an apparent relationship between the bitterness of the beer and its alcoholic content? Draw a scatter plot. Make your best judgment of a relationship and EXPLAIN your answer.

There appears to be an approximately linear relationship between %ABV and IBU that may have some curvature. This positively correlated relationship is likely because people like drinking balanced beers. Higher alcohol beers tend to also be maltier/sweeter (higher residual sugar) which balances the high bitterness (high IBUs).

There appears to be a boundary near ABV=10% that most beers don’t cross. This may be due to limitations on product cost, beer yeast survival at higher ABV, or state/federal regulations on beer that is 10% ABV or more.

The imputed IBU values are evident from the vertical bands at IBU=0, IBU=median(IBU), etc.

## Warning: Removed 1005 rows containing non-finite values (stat_sum).
## Warning: Removed 1005 rows containing non-finite values (stat_smooth).

## `geom_smooth()` using method = 'gam' and formula 'y ~ s(x, bs = "cs")'

## `geom_smooth()` using method = 'gam' and formula 'y ~ s(x, bs = "cs")'

8. Budweiser would also like to investigate the difference with respect to IBU and ABV between IPAs (India Pale Ales) and other types of Ale (any beer with “Ale” in its name other than IPA). You decide to use KNN classification to investigate this relationship. Provide statistical evidence one way or the other. You can of course assume your audience is comfortable with percentages . KNN is very easy to understand conceptually.

First, find beers with “IPA” or “Ale” directly in their name for the training set (ex: Ranger IPA). Second, find IPAs and Ales without those identifiers in their names for the test set (ex: .

How well does kNN classify IPAs vs. Ales based on beer name alone? Can it correctly classify IPAs and Ales that don’t have “IPA” or “Ale” in the name?

Using predicted values for missing data, the kNN model predicts IPAs vs. Ales reasonably well with 90% accuracy.

There is a trade-off between Sensitivity and Specificity with the kNN model. k=5 gives higher sensitivity (classifies fewer Ales incorrectly as IPAs), but k=11 gives higher specificity (classifies fewer IPAs incorrectly as Ales).

Now exclude the original missing values and repeat the kNN analysis.

In general, the model with excluded NA values performed worse than the model with predicted values for NA’s.

9. Knock their socks off! Find one other useful inference from the data that you feel Budweiser may be able to find value in. You must convince them why it is important and back up your conviction with appropriate statistical evidence.

What makes a Pale Ale vs. IPA anyway?

IPAs are generally (but not always) hoppier and boozier than Pale Ales. American Double / Imperial IPAs are pushing this trend to new limits. Belgian Strong Pale Ales are outliers and may be more similar to Belgian Strong Ales rather than Pale Ales.

##              Pale Ale             White IPA       Strong Pale Ale 
##                   281                    11                     7 
##                   IPA Double / Imperial IPA 
##                   455                   105

## Warning: Removed 299 rows containing non-finite values (stat_smooth).
## Warning: Removed 299 rows containing missing values (geom_point).

## Warning: Removed 299 rows containing non-finite values (stat_smooth).

## Warning: Removed 299 rows containing missing values (geom_point).

Do people actually like boozier beers?

Deeper dive on alcohol content and beer ratings from Kaggle data set. Yes, people like boozier beers which is a statistically significant result using ANOVA.

## Warning: Removed 67785 rows containing non-finite values (stat_ydensity).

##                         Df  Sum Sq Mean Sq F value Pr(>F)    
## beer_reviews$group       4  144907   36227    6837 <2e-16 ***
## Residuals          1518817 8047779       5                   
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 67792 observations deleted due to missingness
##   Tukey multiple comparisons of means
##     95% family-wise confidence level
## 
## Fit: aov(formula = beer_reviews$beer_abv ~ beer_reviews$group)
## 
## $`beer_reviews$group`
##           diff         lwr        upr     p adj
## 2-1 0.02771935 -0.01927193 0.07471063 0.4914803
## 3-1 0.39868193  0.35553895 0.44182491 0.0000000
## 4-1 0.90598781  0.86335694 0.94861869 0.0000000
## 5-1 1.12213589  1.07506996 1.16920183 0.0000000
## 3-2 0.37096258  0.34805164 0.39387352 0.0000000
## 4-2 0.87826846  0.85633708 0.90019984 0.0000000
## 5-2 1.09441654  1.06477205 1.12406104 0.0000000
## 4-3 0.50730588  0.49572476 0.51888700 0.0000000
## 5-3 0.72345396  0.70039029 0.74651764 0.0000000
## 5-4 0.21614808  0.19405719 0.23823897 0.0000000

## 
##  Pearson's product-moment correlation
## 
## data:  beer_reviews$beer_abv and beer_reviews$review_overall
## t = 172.34, df = 1518820, p-value < 2.2e-16
## alternative hypothesis: true correlation is not equal to 0
## 95 percent confidence interval:
##  0.1369288 0.1400485
## sample estimates:
##      cor 
## 0.138489

Conclusion

In conclusion, there is a wealth of information in this data set on Beers, Breweries, States, and beer characteristics (Style, IBU, ABV, etc.):
- There are dominant brewing states in each region of the US.
- Missing IBU data may be important for assessing emerging regional trends. - Alcohol (ABV) content is shifting higher driven by demand for bigger, boozier beers. - People generally like drinking beers that are balanced for alcohol/sweetness vs. bitterness. - Some beer styles can be reliably identified based only on the IBU and ABV content.